30 research outputs found

    Fast rates for empirical vector quantization

    Get PDF
    We consider the rate of convergence of the expected loss of empirically optimal vector quantizers. Earlier results show that the mean-squared expected distortion for any fixed distribution supported on a bounded set and satisfying some regularity conditions decreases at the rate O(log n/n). We prove that this rate is actually O(1/n). Although these conditions are hard to check, we show that well-polarized distributions with continuous densities supported on a bounded set are included in the scope of this result.Comment: 18 page

    Quantization/clustering: when and why does k-means work?

    Get PDF
    Though mostly used as a clustering algorithm, k-means are originally designed as a quantization algorithm. Namely, it aims at providing a compression of a probability distribution with k points. Building upon [21, 33], we try to investigate how and when these two approaches are compatible. Namely, we show that provided the sample distribution satisfies a margin like condition (in the sense of [27] for supervised learning), both the associated empirical risk minimizer and the output of Lloyd's algorithm provide almost optimal classification in certain cases (in the sense of [6]). Besides, we also show that they achieved fast and optimal convergence rates in terms of sample size and compression risk

    Une fonction distance à k points pour l'inférence géométrique robuste

    Get PDF
    Analyzing the sub-level sets of the distance to a compact sub-manifold of R d is a common method in topological data analysis, to understand its topology. Therefore, topological inference procedures usually rely on a distance estimate based on n sample points [41]. In the case where sample points are corrupted by noise, the distance-to-measure function (DTM, [16]) is a surrogate for the distance-to-compact-set function. In practice, computing the homology of its sub-level sets requires to compute the homology of unions of n balls ([28, 14]), that might become intractable whenever n is large. To simultaneously face the two problems of a large number of points and noise, we introduce the k-power-distance-to-measure function (k-PDTM). This new approximation of the distance-to-measure may be thought of as a k-pointbased approximation of the DTM. Its sublevel sets consist in unions of k balls, and this distance is also proved robust to noise. We assess the quality of this approximation for k possibly drastically smaller than n, and provide an algorithm to compute this k-PDTM from a sample. Numerical experiments illustrate the good behavior of this k-points approximation in a noisy topological inference framework.Afin de comprendre la topologie d'une sous-variété compacte de R^d, il est courant en analyse topologique des données d'analyser les sous-niveaux de la fonction distance à cette sous-variété. C'est pourquoi, les procédures d'inférence topologique reposent souvent sur des estimées de la fonction distance, construites sur n points. Lorsque l'échantillon de points est corrompu par des données aberrantes, la fonction distance à la mesure (DTM) est une alternative à la distance au compact. En pratique, le calcul de l'homologie de ses sous-niveaux revient à calculer l'homologie d'unions de n boules, ce qui devient impossible lorsque n est grand. Afin de pallier simultanément le problème du grand nombre de points et du bruit, nous introduisons la fonction k-puissance distance à la mesure (k-PDTM). Il s'agit d'une nouvelle approximation de la fonction distance à la mesure qui peut être vue comme une approximation de la DTM basée sur k points. Ses sous-niveaux sont des unions de k boules, et cette distance est robuste au bruit. Nous étudions la qualité de cette approximation pour k possiblement beaucoup plus petit que n, et fournissons un algorithme permettant de calculer cette k-PDTM à partir d'un échantillon de points. Des expériences numériques illustrent le bon comportement de cette approximation construite sur k points, dans le cadre de l'inférence topologique avec bruit

    Non Asymptotic Bounds for Vector Quantization in Hilbert Spaces

    Get PDF
    30 pages, technical proofs are omitted and can be found in the related unpublished paper "Margin conditions for vector quantization"Recent results in quantization theory show that the convergence rate for the mean-squared expected distortion of the empirical risk minimizer strategy, for any fixed probability distribution satisfying some regularity conditions is O(1/n), where n is the sample size. However, the dependency of the average distortion on other parameters is not known. This paper offers more general conditions, which may be thought of as margin conditions, under which a sharp upper bound on the expected distortion rate of the empirically optimal quantizer is derived. This upper bound is also proved to be sharp with respect to the dependency of the distortion on other natural parameters of the quantization issue

    Optimal quantization of the mean measure and applications to statistical learning

    Get PDF
    This paper addresses the case where data come as point sets, or more generally as discrete measures. Our motivation is twofold: first we intend to approximate with a compactly supported measure the mean of the measure generating process, that coincides with the intensity measure in the point process framework, or with the expected persistence diagram in the framework of persistence-based topological data analysis. To this aim we provide two algorithms that we prove almost minimax optimal. Second we build from the estimator of the mean measure a vectorization map, that sends every measure into a finite-dimensional Euclidean space, and investigate its properties through a clustering-oriented lens. In a nutshell, we show that in a mixture of measure generating process, our technique yields a representation in Rk\mathbb{R}^k, for kNk \in \mathbb{N}^* that guarantees a good clustering of the data points with high probability. Interestingly, our results apply in the framework of persistence-based shape classification via the ATOL procedure described in \cite{Royer19}

    Optimal quantization of the mean measure and application to clustering of measures

    Get PDF
    This paper addresses the case where data come as point sets, or more generally as discrete measures. Our motivation is twofold: first we intend to approximate with a compactly supported measure the mean of the measure generating process, that coincides with the intensity measure in the point process framework, or with the expected persistence diagram in the framework of persistence-based topological data analysis. To this aim we provide two algorithms that we prove almost minimax optimal. Second we build from the estimator of the mean measure a vectorization map, that sends every measure into a finite-dimensional Euclidean space, and investigate its properties through a clustering-oriented lens. In a nutshell, we show that in a mixture of measure generating process, our technique yields a representation in Rk\mathbb{R}^k, for kNk \in \mathbb{N}^* that guarantees a good clustering of the data points with high probability. Interestingly, our results apply in the framework of persistence-based shape classification via the ATOL procedure described in \cite{Royer19}

    La k-PDTM : un coreset pour l'inférence géométrique

    Get PDF
    Analyzing the sub-level sets of the distance to a compact sub-manifold of R d is a common method in TDA to understand its topology. The distance to measure (DTM) was introduced by Chazal, Cohen-Steiner and Mérigot in [7] to face the non-robustness of the distance to a compact set to noise and outliers. This function makes possible the inference of the topology of a compact subset of R d from a noisy cloud of n points lying nearby in the Wasserstein sense. In practice, these sub-level sets may be computed using approximations of the DTM such as the q-witnessed distance [10] or other power distance [6]. These approaches lead eventually to compute the homology of unions of n growing balls, that might become intractable whenever n is large. To simultaneously face the two problems of large number of points and noise, we introduce the k-power distance to measure (k-PDTM). This new approximation of the distance to measure may be thought of as a k-coreset based approximation of the DTM. Its sublevel sets consist in union of k-balls, k << n, and this distance is also proved robust to noise. We assess the quality of this approximation for k possibly dramatically smaller than n, for instance k = n 1 3 is proved to be optimal for 2-dimensional shapes. We also provide an algorithm to compute this k-PDTM.L'analyse des sous niveaux de la fonction distance à une variété compacte de R d est très fréquente en analyse topologique des données, avec pour objectif d'en comprendre la topologie. La distance à la mesure (DTM) a été introduite par Chazal, Cohen-Steiner et Mérigot avec l'objectif de remédier au caractère non robuste au bruit et aux données aberrantes de la distance à un compact. Cette fonction rend possible l'inférence de la topologie d'un sous-ensemble compact de R d à partir d'un nuage de n points tirés dans un voisinage proche de la sous-variété au sens de Wasserstein. En pratique, les sous-ensembles de niveau de cette fonction peuvent être estimés en utilisant des approximations de la DTM tels que la q-witnessed distance ou d'autres fonctions puissance. Ces approches reviennent à calculer l'homologie de l'union de n boules, ce qui devient impossible en pratique lorsque n devient trop grand. Afin de traiter le problème du grand nombre de points et du bruit, on introduit la fonction k-puissance distance à la mesure (k-PDTM). Cette nouvelle approximation de la distance à la mesure peut être vue une approximation de la DTM s'appuyant sur un kk-coreset. Ses sous-niveaux seront alors des unions de k boules pour k<<n, et cette fonction est également robuste au bruit. On étudie la qualité de cette approximation lorsque k est très petit par rapport à n. Par exemple, le choix de k=n^{1/3} est optimal pour des formes en dimension 2. On fournit également un algorithme pour calculer cette fonction k-PDTM

    ATOL: Measure Vectorisation for Automatic Topologically-Oriented Learning

    Get PDF
    Robust topological information commonly comes in the form of a set of persistence diagrams, finite measures that are in nature uneasy to affix to generic machine learning frameworks. We introduce a learnt, unsupervised measure vectorisation method and use it for reflecting underlying changes in topological behaviour in machine learning contexts. Relying on optimal measure quantisation results the method is tailored to efficiently discriminate important plane regions where meaningful differences arise. We showcase the strength and robustness of our approach on a number of applications, from emulous and modern graph collections where the method reaches state-of-the-art performance to a geometric synthetic dynamical orbits problem. The proposed methodology comes with only high level tuning parameters such as the total measure encoding budget, and we provide a completely open access software
    corecore